Extracting Semantic Classes and Morphosyntactic Features for English-Polish Machine Translation

نویسندگان

  • Barbara Gawronska
  • Björn Erlendsson
  • Hanna Duczak
چکیده

This paper describes a procedure aimed at automatic extraction of certain noun and verb categories from Polish texts. The general goal is to construct a lexical database that should be incorporated into a system for machine translation and multilingual generation of summaries. High quality processing of inflectional languages like Polish requires quite elaborated lexical entries, it is therefore highly desirable to automate the process of lexicon construction, at least partially. However, purely statistical methods for languages with less elaborated inflectional systems do not perform especially well on Slavic languages. As primary cues for automatic subcategorization we used inflectional morphemes expressing the greatest number of semantico-syntactic functions. The crucial semantic category for noun classification was the degree of animacy. Morphosyntactically, this category is expressed by nominal suffixes and subject-verb agreement markers. The procedure for lexical extraction and classification was implemented in Delphi and the system was trained for extraction of so-called superanimate nouns, i.e. nouns denoting male human beings, or groups including both male and female humans. The usability of lexical extraction based on concurrence of morphological features rather than on concurrence of whole word forms is evaluated and discussed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic-English Semantic Word Class Alignment to Improve Statistical Machine Translation

Clustering words is a widely used technique in statistical natural language processing. It requires syntactic, semantic, and contextual features. Especially, semantic clustering is gaining a lot of interest. It consists in grouping a set of words expressing the same idea or sharing the same semantic properties. In this paper, we present a new method to integrate semantic classes in a Statistica...

متن کامل

A Comparative Study of English-Persian Translation of Neural Google Translation

Many studies abroad have focused on neural machine translation and almost all concluded that this method was much closer to humanistic translation than machine translation. Therefore, this paper aimed at investigating whether neural machine translation was more acceptable in English-Persian translation in comparison with machine translation. Hence, two types of text were chosen to be translated...

متن کامل

Machine Learning of Syntactic Attachment from Morphosyntactic and Semantic Co-occurrence Statistics

The paper presents a novel approach to extracting dependency information in morphologically rich languages using co-occurrence statistics based not only on lexical forms (as in previously described collocation-based methods), but also on morphosyntactic and wordnet-derived semantic properties of words. Statistics generated from a corpus annotated only at the morphosyntactic level are used as fe...

متن کامل

Utilizing Semantic Equivalence Classes of Japanese Functional Expressions in Translation Rule Acquisition from Parallel Patent Sentences

In the “Sandglass” MT architecture, we identify the class of monosemous Japanese functional expressions and utilize it in the task of translating Japanese functional expressions into English. We employ the semantic equivalence classes of a recently compiled large scale hierarchical lexicon of Japanese functional expressions. We then study whether functional expressions within a class can be tra...

متن کامل

Boosting Statistical Machine Translation by Lemmatization and Linear Interpolation

Data sparseness is one of the factors that degrade statistical machine translation (SMT). Existing work has shown that using morphosyntactic information is an effective solution to data sparseness. However, fewer efforts have been made for Chinese-to-English SMT with using English morpho-syntactic analysis. We found that while English is a language with less inflection, using English lemmas in ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002